Best practices to evaluate the impact of biomedical research software—metric collection beyond citations

Abstract Motivation Software is vital for the advancement of biology and medicine. Impact evaluations of scientific software have primarily emphasized traditional citation metrics of associated papers, despite these metrics inadequately capturing the dynamic picture of impact and despite challenges with improper citation. Results To understand how software developers evaluate their tools, we conducted a survey of participants in the Informatics Technology for Cancer Research (ITCR) program funded by the National Cancer Institute (NCI). We found that although developers realize the value of more extensive metric collection, they find a lack of funding and time hindering. We also investigated software among this community for how often infrastructure that supports more nontraditional metrics were implemented and how this impacted rates of papers describing usage of the software. We found that infrastructure such as social media presence, more in-depth documentation, the presence of software health metrics, and clear information on how to contact developers seemed to be associated with increased mention rates. Analysing more diverse metrics can enable developers to better understand user engagement, justify continued funding, identify novel use cases, pinpoint improvement areas, and ultimately amplify their software’s impact. Challenges are associated, including distorted or misleading metrics, as well as ethical and security concerns. More attention to nuances involved in capturing impact across the spectrum of biomedical software is needed. For funders and developers, we outline guidance based on experience from our community. By considering how we evaluate software, we can empower developers to create tools that more effectively accelerate biological and medical research progress. Availability and implementation More information about the analysis, as well as access to data and code is available at https://github.com/fhdsl/ITCR_Metrics_manuscript_website.


Introduction
Biomedical software has become a critical component of biomedical research and enabling major advancements of medicine.Often such software is initially developed so that the developers can use it themselves and then used by others for research (Bitzer et al. 2007).However, the life span of biomedical software projects is often cut short because maintenance and continued evaluation is not prioritized by funding institutions or promotion committees (Prli� c and Procter 2012).Ultimately the academic infrastructures built around manuscripts are from a time before software and the internet and result in an inefficient ecosystem that rewards new software but does not invest in software that has already been built.This revolving door ultimately undermines the impact that software projects can have on biomedical research and ultimately healthcare.Prioritizing metric collection beyond citations can help funders, promotion committees, and developers to better understand the impact and challenges of software projects (Waller 2018).
To understand current practices and challenges of software developers, we performed a survey of participants in the Informatics Technology for Cancer Research (ITCR) program funded by the National Cancer Institute (NCI).We also manually investigated software among this community to assess how often infrastructure that supports evaluations is implemented and how this impacts rates of papers describing usage of the software.We find that developers recognize the utility of analysing software usage, but struggle to find the time or funding for such analyses.Recognizing the significance of comprehensive software metrics, and providing dedicated funding for developers to robustly collect and analyse such data, would enable biomedical software and the research it supports to achieve drastically greater real-world impact.

Citations alone are not enough
Software with impact is not necessarily highly cited.A study of 4971 academic biomedical and economics articles found that software citations only included version information 28% of the time (Howison and Bullard 2016).Another study evaluating 90 biology articles found that version information was included only 27% of the time and URL information only 17% of the time (Du et al. 2021).Specifically, among ITCR-funded software examples, users may forget to cite a tool for visualization, such as the UCSC Xena Genome Browser (Goldman et al. 2020).Users might also forget to cite tools used in initial phases of a project, such as EMERSE (Electronic Medical Record Search Engine) (Hanauer et al. 2015) which helps identify patient cohorts.Tools which provide access to other software may also not be cited.Examples include Bioconductor (Huber et al. 2015), Gene Pattern Notebook (Reich et al. 2017), and Galaxy (The Galaxy Community 2022).Understanding system-level tool usage may require looking at individual tools on these platforms.Finally, researchers often only describe a tool without citing it and can do so in unusual locations within manuscripts, such as a figure legend.
Another challenge is that manuscripts for software are a snapshot and do not reflect the evolving nature of the software.Typically, it is much easier to publish manuscripts for new software.However, researchers can save time if they can continue working with tools they are already familiar with.
A new type of manuscript for software updates has been proposed (Merow et al. 2023).This could reward developers who start working on software after the initial publication, and provide new ways for funding agencies and others to better recognize software maintenance.

Appropriate use of metrics is the way forward
Metrics beyond citations can be very powerful for continued evaluation and improvements to software.Table 1 explains the benefits of software evaluations for developers, including identifing ways to optimize the tool, to guide future work, to garner funding support, to enhance user commitment, and to motivate community development.Citations alone are insufficient in capturing the dynamic nature of scientific software usage and they are inadequate for helping guide developers to improve their tools.
Evaluation metrics for the purpose of continued development can include the number of new users, returning users, and total downloads of the software, but the types of possible metrics vary based on the type of tool and context (see Table 2).These and other metrics can allow assessment of the rate of establishment within a community.Proper metrics should not only examine the software's performance but assess if motivations and goals of the users using the software are being met.Ideally metrics also help gather information about the downstream impact of the tool on biomedical research.
Despite these strengths, developers and funders must understand challenges and nuance in interpreting these metrics.Communities like CHAOSS (Community Health Analytics in Open Source Software) have focused on the proper collection, evaluation, and standards for software metric collection https://chaoss.community/.In an effort to have more expansive metrics adopted by the biomedical research community, we aim to provide guidance for evaluations of software impact and engagement.We also discuss ethical considerations and challenges of such evaluations that still require solutions.The guidance presented here holds the potential for developers to improve the use and utility of their tools, improve their chances of funding for future development, and ultimately lead to the development of even more useful software to advance science and medicine (Wratten et al. 2021).

Materials and methods
We performed two analyses to get a sense of software evaluation within the community of developers of the ITCR program funded by the NCI.Our first surveyed developers to better understand how they think about software evaluation.Our second aimed to determine what infrastructure is often implemented to support software evaluation and if such implementation was associated with the frequency of papers describing usage of the software (see Supplemental Note S1).
In the first analysis, we surveyed 48 ITCR participants.Limited time (68% of respondents) and funding (57% respondents) were major barriers for performing software impact evaluations (respondents could select multiple barriers).Although a few funding mechanisms support the maintenance and analysis of software (as opposed to creation of new software), such as the ITCR sustainment awards (Kibbe et al. 2017, Warner andKlemm 2020), or the Essential Open Source Software for Science program of the Chan Zuckerberg Initiative (Science 2019), more funding for software sustainability is needed compared to what is currently available.Awareness of this need was also demonstrated by the recent Declaration on Funding Research Software Sustainability by the Research Software Alliance (ReSA) (Barker et al. 2023).While scientific software has become critical to most researchers, the funding to support the maintenance of such software is not reflective of the current level of usage (Siepel 2019).The next major barriers were privacy concerns (38% of respondents), technical issues (32% of respondents), and not knowing what methods to use for evaluations (27% of respondents).Despite these apparent challenges, 73% of respondents state

Performance
• Maximum memory usage (Eisty et al. 2018) • Average time-to-complete of algorithmic steps (Eisty et al. 2018) • Requirements analysis • Tuning A variety of metrics can be used to attempt to interpret to usefulness, reliability, and uptake by the community and more.Here, we describe metrics used by the authors of the paper.See Lenarduzzi et al. (2020), Eisty et al. (2018), Thelwall and Kousha (2016) for more information about metrics used by others.
Best practices to evaluate the impact of biomedical research software that such evaluations have informed new development ideas, 60% stated that it informed documentation, and 54% stated that it helped justify funding (respondents could select multiple benefits).Thus, additional support for evaluations of software usage and impact could greatly benefit the continued development of software.
Responses to an open-ended question asking "Is there anything you would like to measure but have been unable to capture?" included (each of these examples were unique responses): collaborations that the tool supported, the number of commercial applications using the tool, the fraction of assumed user base that actually uses the tool, the downstream activity-what do users do with the results, and user frustration.These responses outline many of the challenges that developers often face.See Supplemental Table S1 and Supplemental Note S2 for examples of the goals of the respondents.
We also manually inspected 44 scientific research tools, 33 of which were funded by ITCR alone, and seven funded by the Cancer Target Discovery and Development (CTD 2 ) Network (Aksoy et al. 2017), as well as four tools funded by both.Each were inspected for infrastructure that could help users know about the tool or how to use it, as well as possible infrastructure related to software health metrics that indicate how recently the code was rebuilt or tested (Srivastava and Schumann 2011).We then investigated if there were any associations with these aspects and usage.A variety of different types of research-related tools or resources were inspected-see Table 3.Each tool or resource was manually inspected (by someone not involved in developing these tools) to get the experience of a potential user briefly examining related websites to determine if the tool had: a DOI for the software itself, information on how to cite the software, information on how to contact the developers, documentation (and how much), an X/Twitter presence, and badges about software health metrics (such as those related to maintenance and testing) (Srivastava and Schumann 2011) visible on a related website.
To evaluate a proxy for usage, we used the SoftwareKG-PMC database (Kr€ uger and Schindler 2020), which does not include citations to tools, only plain-text mentions inferred by a text-mining algorithm.This was to enable us to capture cases where users may have mentioned but not necessarily cited a tool.Importantly, mentions also do not always indicate usage.The database does not know anything about these tools per se, and not all of these mentions necessarily correspond to the same tool.For example, DANA is an ITCR tool for microRNA analysis but there are other tools with the same name.Although time since the tool release was the largest contributor to variation in the number of papers describing usage, various aspects of infrastructure that could help users know about a tool (social media on twitter), have confidence in the tool (badges about software builds or tests), or learn more about how to use the tool (extensive documentation and feedback mechanisms) all seemed to be associated with an increased rate of manuscripts that described using the tool.All show significant association (P < 0.05) with usage when not accounting for tool age.Only having extensive feedback mechanisms was significantly associated when also accounting for tool age (see Fig. 1).For more information about this analysis, see our website https://hutchdatascience.org/ITCR_Metrics_manuscript_website/.

Results
The results of our evaluation of scientific software suggests that infrastructure can support the collection of more metrics and support more mentions of software in papers.Specifically, our results showed that active social media, more in-depth documentation, clear methods to contact developers, and software health metrics [metrics related to how often the software is tested, developed, etc (Srivastava and Schumann 2011)] appear to enhance mentions in papers.
The infrastructure described in Table 4 and Supplemental Note S3 could enable more comprehensive metrics about insights regarding software usage and impact.Funders and developers should consider these elements when considering the impact and new directions for a project.

Discussion
With new metrics collected through the software infrastructure described in Table 4 comes a new host of challenges that require guidance.Here, we layout how the metrics collected from the infrastructure discussed in the previous section should be handled appropriately.The following are guidance based on the composite experience of the authors:

Successful evaluations are anchored by an understanding of the intended use of the software
The intended goal or purpose of the scientific software should inform how the software is evaluated (Basili et al. 1994).Computational tools are designed to support well-defined goals often called use cases (Gamma et al. 1995) for specific sets of users called personas (Cooper 2004).Efforts to evaluate the impact of tools should be guided by a clear understanding of these, use cases and personas to assess how well the tools meet the intended goals and for all intended users.

Metric selection should be hypothesis driven
Collecting an exhaustive amount of user data before selecting metrics can increase the risk that metrics are selected in a biased manner.This can lead to picking metrics that look good but are not necessarily as meaningful to the intended usage of the tool.To mitigate this, metrics can be selected ahead of time based on a specific hypothesis to ultimately evaluate how well the software supports its intended goals (Mullen 2020).

No single evaluation method works for every type of software
No individual scheme for collecting metrics fits every type of software tool.The meaning of a set of metrics may differ across contexts.For example, the location of a tool (e.g. on the web or downloaded) can influence user access to software versions and how one might collect metrics.For a web-based application, users will rarely have access to older versions.Thus, developers can add version updates and collect metrics with clarity about how usage changed.For locally run tools, users may be using older, previously-downloaded versions.Additionally, tools that are installed on institutional servers have much smaller installation counts than those installed on individual computers.No one metric is one size fits all and each software tool must be thoughtfully planned out for how it should be evaluated.

Metrics require interpretation
Metric interpretation is rarely straightforward.A spike may correspond to a workshop using the tool or a recent publication citing the tool.Negative trends may indicate a break in the academic calendar, holidays, down time of a host server, or software bugs.It is also important to avoid comparisons between metrics for tools with different users and contexts.Total unique downloads might indicate software popularity, but does not tell us whether users found it useful.Instead, metrics about returned usage by the same users or the number of launches of the software over a certain predefined session time threshold may better evaluate actual usage.For tools that offer access or analyses of different data types, one may want to parse usage by data types to evaluate how useful the tool appears to support different kinds of users.Specific measures can provide a common basis comparing versions and potentially against other similar software.Recordings can be posted on social media (for additional metrics). (continued)

Metrics of best practices provide indicators of software health
Tracking adherence to best practices of software engineering can be a useful way to assess software project health Srivastava and Schumann (2011), including the use of version control systems, high coverage of code with testing, and use of automated or continuous integration.None of these measures of project health are perfect (and can be done poorly) but can collectively indicate software health.Including badges for such indicators on code repositories and websites can give users and others confidence.Some software packages can help automatically assess package health like the riskmetric package (https://pharmar.github.io/riskmetric/)for evaluation of R packages (R Validation Hub et al. 2024).Additional detail on these topics, can be found in The Pragmatic Programmer (Thomas and Hunt 2019).See Table 5 and Supplemental Note S4 for suggestions.
4.6 Metrics related to software quality and reusability could reassure users and funders Software reusability metrics have been suggested to enable better discernment of the capacity for code to be reused in other contexts.These metrics can also evaluate if code is written to be more resilient over time to dependency changes and other maintenance challenges.One example would be the degree to which aspects of the software are independent of one another (Mehboob et al. 2021).As research funders start to value software maintenance more, metrics related to resilience and reusability may become more valuable.Other similar metrics related to maintainability have been used in the software community for some time relying on metrics such as the number of code comments, lines of code, or code complexity metrics (Wang 2006), but open source software projects with community contributors can make aspects related software maintainability a challenge (Oman and Hagemeister 1994, Welker 2001, Ganpati et al. 2012).

Challenges and nuances
Here, we outline a number of challenges and nuances associated with evaluating metrics for software usage and impact.

Distorted metrics
Projects like the ITCR-funded Bioconductor (Huber et al. 2015), with a large variety of software packages, offer an opportunity to assess distortion of metrics by evaluating how different packages are used over time, revealing important nuances (see Table 6).Overall major themes seen include, accidental usage by scripts that accidentally loop through downloading a piece of software many times, usage of software to support other software for technical reasons, as well as unexpected patterns of persistent use after a tool is theoretically no longer as useful.This is believed to be due to downloads on servers using lists of historically typically used packages.Finally, background levels of usage with low levels of downloads even for tools that are no longer supported.

Clinical data challenges
Clinical data often contain protected health information (PHI).Thus, the number of individuals that have access to the data is generally smaller.Many tools containing clinical data are also run at an enterprise level (such as the ITCRfunded tool, EMERSE), meaning they are installed only one time by system administrators and accounts are provisioned to users.Thus, counting installations does not represent the overall use.Further, security mechanisms to protect clinical data inhibit developers from accessing the installed systems themselves.Ultimately, due to downloads typically being at an institutional level for clinical tools, metrics around software downloads underestimate their impact.It would not be realistic to compare the usage metrics of such tools to more widely available and accessible tools.

Goodhart's law
Goodhart's law states that "every measure which becomes a target becomes a bad measure" (Hoskin 1996).For example, h-indices (the number of papers an author has with that many or more citations) are used to assess an author's impact.As the h-index grew in popularity, the number of researchers included as coauthors, the number of citations per paper, and the fraction of self-citations increased, each leading to an increased h-index.Although metrics could be used to bring about best practices for binary outcomes (i.e.public deposition of code), for more quantitative metrics (e.g.number of downloads) the results could easily become Best practices to evaluate the impact of biomedical research software meaningless.The impact behind this concept cannot be entirely avoided because of fundamentals of human behaviour but one way to minimize this effect is to continue to evaluate metrics over time, to consider if our metrics are truly measuring what we think they are, to consider if our metrics are actually fair to a diverse range of projects, and to consider new metrics as needed (Fire and Guestrin 2019).Funding agencies need to consider how each type of tool is context-dependent, and that impact should be compared between similar classes of tools.

Security, legal and ethical considerations
Often with phone-home software (the collection of information from the computers of users that downloaded or installed a particular software) or web-based analytics, users are tracked for specific usage.Occasionally software developers will notify users that they are being tracked, however this is often not required.The General Data Protection Regulation (GDPR), implemented in 2018, requires that organizations anywhere in the world respect certain data collection obligations regarding people in the European Union.It is intended to protect the data privacy of individuals and mostly guards against the collection of identifiable personal information.Data collection of software usage needs to be mindful of the GDPR and any other international regulations.As science is particularly an international pursuit, often users may reside outside the country where the tool was developed.
One way to mitigate this is to let users choose if they wish to be tracked.Developers can also design tracking to be more anonymous.A genome visualization tool may track the number of unique uses, but not track what part of the genome was visualized [as is the case for the UCSC Xena Genome Browser (Goldman et al. 2020)].Google Analytics (https:// marketingplatform.google.com/about/analytics/)provides support to mask unique IP addresses of visitors to a website tracked by the system.Ethical and legal consequences should be considered when designing or implementing tracking systems of software (see Supplemental Note S5 for more information).

Conclusions
Our assessments indicate that cancer software developers of the ITCR find it difficult to find the time or funding to evaluate the impact and usage of their software using metrics, despite their awareness of the benefits.Many have found such evaluations useful for driving future development and obtaining additional funding.A sizable portion (27%) of those surveyed self-reported as not knowing what methods to use for such evaluations.We also find from our manual evaluation of a subset of scientific software tools that tools appear to be more widely used when developers provide deeper documentation, badges about software health metrics, and more in-depth contact information, as well as having a Twitter presence.It is not clear why this is.It may be that those who have the time and support to more thoroughly document and advertise their tools may also have more resources to developer the tool itself, lending to wider usage.However, it may also be that a social media presence brings new users to tools and that the other infrastructure (badges, deeper documentation, etc.) help new users to trust software.Further studies are necessary to understand these patterns.However, it suggests that supporting developers to spend more time on such elements could drive further usage of existing tools.We hope that funding agencies will value supporting developers to evaluate, promote, and maintain existing tools in addition to the current typical model for most agencies to prioritize the creation of new tools.A recent article (Merow et al. 2023) suggested that a new type of manuscript for software updates may help the field to better reward maintenance of existing software.We argue that inclusion of evaluations of software impact and usage could also be incorporated into such a model for software-related manuscripts.
While metric collection beyond traditional citations is only one piece of the software development workflow, we feel that it has been underappreciated by funding institutions and promotion committees.In addition, while common metrics may be valuable for comparisons of similar types of tools, other types of metrics may give more insight about the downstream impact of a tool in terms of what development and advancements in the field that the software supported.For example,

Distortion Example
Accidental usage Occasionally scripts used on servers may inadvertently download a package repeatedly and rapidly hundreds to thousands of times, resulting in distorted download metrics that are not representative of real usage.Unique IP download information is useful to distinguish between one user downloading many times versus many users a few times.Given privacy concerns, an alternative solution could involve tracking the timing and approximate location of downloads with a threshold for what would be more than expected as maximum real usage, like a group of people following a tutorial Background usage There is a baseline background level of downloads across all packages in Bioconductor (including those that are no longer supported).Thus, if a new package has 250 downloads in the first year this may seem like a successful number, but actually it is similar to background levels Technical versus research usage It can be difficult to discern if the usage of a package is for scientific research itself or supporting the implementation of other software.While both are arguably valuable, distinguishing between these motivations can help us understand a particular software's impact in a field.For example, the S4Vectors package (10.18129/B9.bioc.S4Vectors) (Pag� es et al. 2022) is an infrastructure package used by many other packages for technical and non-biological reasons and is therefore not often directly downloaded by end-users.This package is also included in automated checks for other Bioconductor packages using GitHub actions.Another example of support implementation is in the context of container image use.
Containerization software [like Docker (https://www.docker.com/)and Singularity] often install software packages for individual environments that could inflate usage metrics statistics.For instance, a user who is actively developing a container may re-trigger the build and thus installation of associated software many times over the course of a project Usage persistence The affy package (10.18129/B9.bioc.affy)(Gautier et al. 2004) was one of the early packages for microarray analysis, a technology that has largely been replaced by newer technologies, which can be seen by the rate of microarray submissions to GEO overtime.However, despite the field transitioning away from microarray methods (Mantione et al. 2014), the package was downloaded in 2021 at rates that doubled the rates in 2011.The authors speculate that this could be due to people historically requesting that affy be installed on servers and that this is just persisting, or perhaps it is being used for preliminary hypothesis testing using existing micrarray data, or perhaps it is being used because other microarray packages are no longer supported Here, we provide more in-depth information about metric distortion themes identified evaluating tools in Bioconductor (which is ITCR-funded).GEO ¼ Gene Expression Omnibus.
Best practices to evaluate the impact of biomedical research software perhaps we should consider how much a software tool inspires the development of other tools, the value of the papers that cite a tool (perhaps by citation rate, measures of innovation, or measures of clinical impact, such as clinical trials) Certainly as scientific software continues to be critical for scientific and medical advancement, we should continue to think beyond the software citation model and consider the infrastructure and metrics we have discussed here as we determine how to support scientific software developers in the future.

Figure 1 .
Figure 1.Aspects of software infrastructure appear to be associated with a larger number of published manuscripts from users describing usage of the software in the SoftwareKG-PMC database.The X-axis indicates the age of the software by showing the year that it was released.The Y-axis indicates the log of the total number of papers that describe usage of the software in the SoftwareKG-PMC database.See Supplementary material and our website for more information.

Table 1 .
Needs, goals, and benefits of software evaluation.

Table 3 .
Scientific tools and resources evaluated.
Here, we show the variety among the 44 ITCR and CTD 2 scientific research tools/resources evaluated for various characteristics by manual inspection for infrastructure used to support software evaluation metrics beyond software paper citations.

Table 4 .
Software infrastructure can enable the capture of valuable metrics for evaluating engagement and impact.

Table 5 .
Software health infrastructure.